Capitalization and punctuation restoration: a survey

نویسندگان

چکیده

Ensuring proper punctuation and letter casing is a key pre-processing step towards applying complex natural language processing algorithms. This especially significant for textual sources where are missing, such as the raw output of automatic speech recognition systems. Additionally, short text messages micro-blogging platforms offer unreliable often wrong casing. survey offers an overview both historical state-of-the-art techniques restoring correcting word Furthermore, current challenges research directions highlighted.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Recovering Capitalization and Punctuation Marks on Speech Transcriptions

This work addresses two metadata annotation tasks, involved in the production of rich transcripts: automatic capitalization, and punctuation marks recovery. The main focus concerns broadcast news, using both manual and automatic speech transcripts. Different capitalization models were analysed and compared, and results support the ideia that generative approaches capture the structure of writte...

متن کامل

LSTM for punctuation restoration in speech transcripts

The output of automatic speech recognition systems is generally an unpunctuated stream of words which is hard to process for both humans and machines. We present a two-stage recurrent neural network based model using long short-term memory units to restore punctuation in speech transcripts. In the first stage, textual features are learned on a large text corpus. The second stage combines textua...

متن کامل

Automatic Recovery of Punctuation Marks and Capitalization Information for Iberian Languages

This paper shows experimental results concerning automatic enrichment of the speech recognition output with punctuation marks and capitalization information. The two tasks are treated as two classification problems, using a maximum entropy modeling approach. The approach is language independent as reinforced by experiments performed on Portuguese and Spanish Broadcast News corpora. The discrimi...

متن کامل

Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news

The following material presents a study about recovering punctuation marks, and capitalization information from European Portuguese broadcast news speech transcriptions. Different approaches were tested for capitalization, both generative and discriminative, using: finite state transducers automatically built from language models; and maximum entropy models. Several resources were used, includi...

متن کامل

U S C 154(1)) by 971 Days " Restoring Punctuation and Capitalization in Transcribed Speech "

(54) GENERATING PROSODIC CONTOURS FOR 6,871,178 B2 3/2005 Case et al. SYNTHESIZED SPEECH 6,975,987 B1 12/2005 Tenpaku et a1. 6,990,449 B2 1/2006 Case . 6,990,450 B2 l/2006 Case et al. (75) Inventors: Martin Jansclhe, New York, NY (US); 7,035,791 B2 400% Chazan et a1‘ Mlchael DRlley, New York, NY (Us); 7,062,439 B2 6/2006 Brittan et al. Andrew M. Rosenberg, Brooklyn, NY 7,076,426 B1 7/2006 Beutn...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Artificial Intelligence Review

سال: 2021

ISSN: ['0269-2821', '1573-7462']

DOI: https://doi.org/10.1007/s10462-021-10051-x